Course: Visual Analytics for Policy and Management

Prof. José Manuel Magallanes, PhD


Part 2: Visualizing Tabular data

Univariate Case


Contents:

  1. Intro.

  2. Data Types.

  3. Data Processes.

    3.1 Classification.

    3.2 Counting.

    3.3 Measurement.


Most data are commonly organized in tabular format, that is, tables. When data is in tabular format, cases are organized in rows, while variables (information about the cases) are organized in columns. Almost every data you have used in a spreadsheet follows that structure.

For example, when you visit the website of the Common Core of Data from the US Department of Education, you can get a data set with detailed information on public schools at the state of Washington. Let me get a data table I have based on that:

link='https://github.com/EvansDataScience/VisualAnalytics_2_tabularData/raw/master/data/eduwa.rda'
load(file=url(link))

When you are in RStudio, you can view the data table by clicking on its name at the Environment .

It also good to know how much info you have:

#number of rows and columns
dim(eduwa) #nrow(eduwa) ncol(eduwa)
## [1] 2427   24

This is the list of the 24 columns:

names(eduwa)
##  [1] "NCES.School.ID"        "State.School.ID"      
##  [3] "NCES.District.ID"      "State.District.ID"    
##  [5] "Low.Grade"             "High.Grade"           
##  [7] "School.Name"           "District"             
##  [9] "County"                "Street.Address"       
## [11] "City"                  "State"                
## [13] "ZIP"                   "ZIP.4-digit"          
## [15] "Phone"                 "Locale.Code"          
## [17] "LocaleType"            "LocaleSub"            
## [19] "Charter"               "Title.I.School"       
## [21] "Title.1.School.Wide"   "Student.Teacher.Ratio"
## [23] "Free.Lunch"            "Reduced.Lunch"

When dealing with tabular data, you can suspect that you can produce a visualization for each column, and then for a couple of them simultaneously, and then for three or more.

In this material, we will pay attention to the univariate case; which is common for searching problems or veryfing outcomes; not for giving explanations. Then, when dealing with univariate data, you need to be aware of two things: what question you are trying to answer; and how to treat a particular variable to build the plot that will answer that question.

Go to table of contents.

Data Types

I can not anticipate all the questions you can try to answer via plots; but I can tell you that you are always limited by the nature of the variables you have at hand. Generally speaking, you have either categorical or numerical data in each column, and whatever question you have, you first need to know how that variable you are planing to use has been encoded, so you can plan the treatment. In R, we can know that like this:

# this 'width = 70,strict.width='cut' means
# you do not want to see more than 70 characters per row.

str(eduwa,width = 70,strict.width='cut')
## 'data.frame':    2427 obs. of  24 variables:
##  $ NCES.School.ID       : chr  "530486002475" "530270001270" "53091"..
##  $ State.School.ID      : chr  "WA-31025-1656" "WA-06114-1646" "WA-"..
##  $ NCES.District.ID     : chr  "5304860" "5302700" "5309100" "53000"..
##  $ State.District.ID    : chr  "WA-31025" "WA-06114" "WA-34033" "WA"..
##  $ Low.Grade            : Ord.factor w/ 14 levels "PK"<"KG"<"1"<..: ..
##  $ High.Grade           : Ord.factor w/ 15 levels "PK"<"KG"<"1"<..: ..
##  $ School.Name          : chr  "10th Street School" "49th Street Ac"..
##  $ District             : chr  "Marysville School District" "Evergr"..
##  $ County               : chr  "Snohomish" "Clark" "Thurston" "Gray"..
##  $ Street.Address       : chr  "7204 27th Ave NE" "14619B NE 49th S"..
##  $ City                 : chr  "Marysville" "Vancouver" "Tumwater" "..
##  $ State                : chr  "WA" "WA" "WA" "WA" ...
##  $ ZIP                  : chr  "98271" "98682" "98512" "98520" ...
##  $ ZIP.4-digit          : chr  NA "6308" NA "5510" ...
##  $ Phone                : chr  "(360)965-0400" "(360)604-6700" "(36"..
##  $ Locale.Code          : Factor w/ 12 levels "11","12","13",..: 5 2..
##  $ LocaleType           : Factor w/ 4 levels "City","Rural",..: 3 1 ..
##  $ LocaleSub            : Factor w/ 12 levels "City: Small",..: 5 2 ..
##  $ Charter              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1..
##  $ Title.I.School       : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1..
##  $ Title.1.School.Wide  : Factor w/ 2 levels "No","Yes": 2 NA NA 2 N..
##  $ Student.Teacher.Ratio: num  23.4 8.4 21.5 15.9 6.5 15.3 NA 16.3 1..
##  $ Free.Lunch           : num  28 53 169 292 12 411 48 102 101 268 ...
##  $ Reduced.Lunch        : num  3 9 40 10 4 23 12 22 23 0 ...

The ones that say num are obviously numbers (numbers in R are numeric when decimal values are detected, and integer if they are not). The ones that say chr are strings, which are candidates to be key columns, which are not variables themselves, but identifiers of the cases. In this case, the first four are identifiers, as well as the the 7th, 10th and 15th columns (school names, address and phone, respectively). Those variables are not to be analyzed statistically, but may be used for annotating (7th and 15th column) or for geocoding (10th column). Notice that for these data, State is not to be analyzed as it is a constant (all rows are from WA); but it would be if the data were from the whole USA. Then, you see several variables identified as factor or ordered factor, which are categorical variables: they can be analyzed statistically but not in the same way that numbers.

Go to table of contents.

Data Processes

Data is obtained via different processes. When you observe reality, you can classsify, count or measure. Each of these decisions produces data with some basic characteristics; which are represented via categories or numerical values.

Classification

Categorical data are the output of the classification process. The classification can propose an incremental or non-incremental differentiation. The former are named ordinal data and the latter nominal data. A nominal classification related to education can be type of school funding: public or private; while an ordinal one can be: elementary, middle, high, college and graduate school level.

1. Visualization for nominal scales

Let’s see some raw values in the variable LocaleType:

head(eduwa$LocaleType,50) #first fifty values
##  [1] Suburb City   City   Town   City   City   Suburb City   Rural  Rural 
## [11] City   City   City   City   Suburb Rural  Rural  Rural  Suburb City  
## [21] Suburb Suburb Suburb Suburb City   City   Suburb Suburb Rural  Rural 
## [31] Rural  Suburb City   City   Rural  City   City   Rural  Rural  City  
## [41] Town   Town   Rural  City   Suburb City   City   Rural  City   City  
## Levels: City Rural Suburb Town

You can not get a clear idea of what a data table has, so a simple frequency table is the first tool to see what these nominal data are telling us:

# absolute values
table(eduwa$LocaleType,exclude = 'nothing')
## 
##   City  Rural Suburb   Town   <NA> 
##    714    505    798    338     72
# relative values
absoluteT=table(eduwa$LocaleType,exclude = 'nothing')
prop.table(absoluteT)
## 
##       City      Rural     Suburb       Town       <NA> 
## 0.29419036 0.20807581 0.32880099 0.13926658 0.02966625

This table tells us the location of the public schools. What is the right visual for this? Sometimes the answer seems obvious, as tradition or habits give so much weight to decisions. Let’s use the very well known pie chart:

# the pie plots the table:
ToPlot=prop.table(absoluteT)
pie(ToPlot)

You should always keep it simple. Then decorate. For example, you can start improving the plot you already have:

  • The purple sector does not show a label:
names(ToPlot)
## [1] "City"   "Rural"  "Suburb" "Town"   NA

We could alter the fifth label:

names(ToPlot)[5]='Unknown'
  • Our plot did not have a title. Titles (and subtitles) are important. To give a title, it can be a question to be answered by the plot:
# the pie plots the table:
titleText='Where are Public Schools located in WA in 2019?'
sourceText='Source: US Department of Education'

pie(ToPlot,
    main=titleText,
    sub=sourceText)

The title can guide the reader to recognise the purpose of your plot:

# the pie plots the table:
titleText2='WA still has schools locations unknown \n (info from 2018)'

pie(ToPlot,
    main=titleText2,
    sub=sourceText)

The title can also suggest the decision:

# the pie plots the table:
titleText3='WA needs to fully categorize school locations\n(info from 2018)'

pie(ToPlot,
    main=titleText3,
    sub=sourceText)

Titles are no that easy to produce. You need to rewrite them many times, until you find a good combination of words that can be read in less than ten seconds. Is also good to keep in mind that you must never give your audience a cacophonous version (a tongue twister), and neither should you include adjectives.

A general rule for any plot is to make it reachable for the audience you are writing to (read this every day).

Let’s do more customization:

  • You can use the values as labels. If values between [0,1] represent shares, it is better to use a [0,100] scale (in %).
ToPlot*100
##      City     Rural    Suburb      Town   Unknown 
## 29.419036 20.807581 32.880099 13.926658  2.966625
  • You can customize the colors:
# details:
ToPlot=ToPlot*100 # preparing labels
paletteHere=rainbow(length(ToPlot)) # customizing set of colors

# plotting
pie(x=ToPlot,#table
    col = paletteHere, 
    labels = ToPlot,
    main=titleText,
    sub=sourceText)

The labels need better work:

paste0(round(ToPlot,2),'%')
## [1] "29.42%" "20.81%" "32.88%" "13.93%" "2.97%"

Then,

plotLabels=paste0(round(ToPlot,2),'%') # labels for the slices
# plotting
pie(x=ToPlot,#table
    col = paletteHere, 
    labels = plotLabels,
    main=titleText,
    sub=sourceText)

  • You may need to use legends, but considering its cluttering effects:
# plotting
pie(x=ToPlot,#table
    col = paletteHere, 
    labels = plotLabels,
    main=titleText,
    sub=sourceText)
#legend
legend(x="right", #where
       legend=names(ToPlot), #text
       fill = paletteHere) #symbols' colors

Most legends need customization:

# MANAGING THE LEGEND:

pie(x=ToPlot,#table
    col = paletteHere, 
    labels = plotLabels,
    main=titleText,
    sub=sourceText)
#legend
legend(x="right", #where
       legend=names(ToPlot), #text
       fill = paletteHere,
       bty = 'n', # no box
       cex = 0.5  # shrink
       ) #symbols' colors

Most people tend to use pie charts with categorical data, but this should not be the default option to visualize classification (see this discussion). Following the advice from the video in that post, we should turn our pie into a bar chart. Let me do it with the same info I used to build the pie:

# barplot plots the table too

barplot(ToPlot,
         col = paletteHere,
         main=titleText,
         sub=sourceText)

We saved some space as no legend was needed (the less ink the better visual). Speaking of saving, we can get rid of the colors (they were needed to differentiate the slices).

A very important thing to consider is that axes represent some unit of measurement, so make sure that unit is shown:

paletteHereNew=c('gray') # just one color
# plotting
barplot(ToPlot,
     col = paletteHereNew,
     border=NA, #no border
     main=titleText,
     sub=sourceText,
     ylim=c(0,50),
     ylab = '(in %)' # show unit
     )

If you consider annotating the plot, you can use the labels we used before:

# plotting
location=barplot(ToPlot,
     col = paletteHereNew,
     border=NA,
     main=titleText,
     sub=sourceText,
     ylim=c(0,50),
     ylab = '(in %)')

#annotating
text(x=location,y=ToPlot,labels=plotLabels,
     pos = 1,# if pos=3, text will be on top of bar
     cex = 0.8) 

#pos and cex where you want location.  

You may decide to change the orientation of the plot:

# plotting
location=barplot(ToPlot,
     col = paletteHereNew,
     border=NA,
     main=titleText,
     sub=sourceText,
     ylim=c(0,50),
     ylab = '(in %)',
     horiz = T) # ORIENTATION

#annotating
text(x=location,y=ToPlot,labels=plotLabels,
     pos = 1) # this is the position of the label

The problem above is that changing the orientation, changes the axes. Then, we need to do more work:

location=barplot(ToPlot,
         col = paletteHereNew,
         border=NA,
         main=titleText,
         sub=sourceText,
         xlim=c(0,50), #change to xlim
         xlab = '(in %)', #change to xlab
         horiz = T)

#annotating
text(x=ToPlot,y=location, #change of x and y
     labels=plotLabels,
     pos = 4)  # change position of the label

A little more work on the categories names:

location=barplot(ToPlot,
         col = paletteHereNew,
         border=NA,
         main=titleText,
         sub=sourceText,
         cex.names = 0.7, #shrink category names
         xlim=c(0,50), 
         xlab = '(in %)', 
         horiz = T)

#annotating
text(x=ToPlot,y=location,labels=plotLabels,pos = 4)  

We made the right changes, but some things do not look well. It would be better if:

  • The subtitle (source) and the label of the x-axis were not that close. A good step will be to have the subtitle as an element of its own, which allows, for instance, to decide its alignment and size:
location=barplot(ToPlot,
         col = paletteHereNew,
         border=NA,
         main=titleText, # no sub here!
         xlim=c(0,50), 
         cex.names = 0.5,
         xlab = '(in %)', 
         horiz = T)

# annotating
text(x=ToPlot,y=location,labels=plotLabels,pos = 4)  

# subtitle
title(sub=sourceText, 
      adj=0,#adj=1 aligns to rigth.
      cex.sub=0.7) #shrinking text

  • To have the label of the x-axis closer to the axis itself, we need to alter the graphical parameters:
# changing parameters
# (distanceOfUnit To plot, 
# distanceOfAxislabels to plot,
# distance ofTicks to plot)
# default is: mgp=c(3, 1, 0)

par(mgp=c(0.5,0.5,0)) 
#####

location=barplot(ToPlot,
         col = paletteHereNew,
         border=NA,
         main=titleText,
         xlim=c(0,50), 
         xlab = '(in %)',
         cex.names = 0.6,
         cex.lab=0.6, # shrinking label text
         horiz = T) 

text(x=ToPlot,y=location,labels=plotLabels,pos = 4) 

title(sub=sourceText, adj=0,cex.sub=0.7,
      line = 3) #push the text down

  • It is generally a good idea to add a reference line, which can represent an expected value or another relevant value. Since I have four different locations (not considering the missing ones), let me put a line to signal the 25% (uniform share among four locations):
titleText2='Are all locations getting a fair share of public schools in WA?'


par(mgp=c(1,0.5,0)) 
location=barplot(ToPlot,
         col = paletteHereNew,
         border=NA,
         main=titleText2,
         xlim=c(0,50), 
         cex.names = 0.6,
         cex.lab=0.6,
         xlab = '(in %)',
         horiz = T
         ) 

text(x=ToPlot,y=location,labels=plotLabels,pos = 4) 
title(sub=sourceText, adj=0,cex.sub=0.7,
      line = 3) 

# reference line
abline(v=25,#position vertical
       lty=3,#type
       lwd=3)#width

Again, adding another element requires adjusting what we had. What about writing your own axis-values and reducing the bar annotations:

par(mgp=c(1,0.5,0)) 
location=barplot(ToPlot,
         col = paletteHereNew,
         border=NA,
         main=titleText2,
         xlim=c(0,50), 
         xlab = '(in %)',
         cex.names=0.6,
         cex.lab=0.6,
         las=2,
         horiz = T,
         xaxt="n") # no x-axis, so I customize it below...

text(x=ToPlot,y=location,labels=plotLabels,pos = 4,cex = 0.7) 
title(sub=sourceText, adj=0,cex.sub=0.7,line = 3) 

#reference line
abline(v=25,lty=3,lwd=3)


# customizing tick values
newXvalues<-c(0,10,25,40,50) # you just want to show this on the axis
axis(side=1, 
     at=newXvalues, 
     labels = newXvalues,
     cex.axis=0.8)

So far, we have used the basic R capabilities for plotting.

There are alternative libraries, like ggplot2, that are also frequently used. However, it has a different approach, which allows to add layers that let you customize your plot. The classic approach for ggplot is:

  • Avoid missing values and prepare frequency table. We replaced the missing values (now they are ‘Unknown’). Here, you need to transform the table into a data frame:
tableFreq=as.data.frame(ToPlot)
names(tableFreq)=c("locale","pct")
#you have:
tableFreq
##    locale       pct
## 1    City 29.419036
## 2   Rural 20.807581
## 3  Suburb 32.880099
## 4    Town 13.926658
## 5 Unknown  2.966625
  • Call the library:
library(ggplot2)
  • Create the base object, which is not a plot, just informing the main variables:
#base GGPLOT2 starts with a "base", telling WHAT VARIABLES TO PLOT
base= ggplot(data = tableFreq, 
             aes(x = locale,
                 y = pct)) 
  • On top of the previous object, add the layer that produces the main plots (the next layers will add or customize elements in the plot):
plot1 = base + geom_bar(fill ="gray",
                        stat = 'identity') # y is just what it is!
plot1

  • We can now pay attention to the titles:

EXERCISE ANSWER

base=ggplot(data = tableFreq, 
             aes(reorder(x=locale,-pct),
                 y = pct))
plot1= base+ geom_bar(fill ="gray",
                        stat = 'identity') # y is just what it is!
plot1 

plot2 = plot1 + labs(title=titleText2,
                     x =NULL, 
                     y = NULL,
                     caption = sourceText)
plot2

  • Add the reference lines:
plot3 = plot2 + geom_hline(yintercept = 25, #where
                           linetype="dashed", 
                           size=1.5, #thickness
                           alpha=0.5) #transparency
plot3

  • Customize the axes:
library(scales)

# customize Y axis
plot4 = plot3 + scale_y_continuous(breaks=c(0,10, 25,40,50),
                                 limits = c(0, 50), # expand = c(0, 0),
                                 labels=scales::unit_format(suffix = '%')) 
plot4

  • Less ink and title/subtitle positions:
plot5 = plot4 + theme(panel.background = element_rect(fill = "white",
                                                    colour = "grey50"),
                    plot.caption = element_text(hjust = 0), # default was 1
                    plot.title = element_text(hjust = 0.5))
plot5

  • annotating the bars:
plot6 = plot5 + geom_text(aes(y = pct ,
                            label = paste0(round(pct,2), '%')),
                        vjust=1, # if flipping 'hjust'
                        size = 3)
# wanna flip the plot?
plot6 #+ coord_flip()

Bar plots are the default option for categorical variables. In general, you see the distribution of the classification, which allows you to identify concentration. For that reason, ordering the bars by height can be helpful:

###
ToPlotOrd=sort(ToPlot)
###

par(mgp=c(1,0.5,0)) # distance label, tickText,tick
location=barplot(ToPlotOrd,
         col = paletteHereNew,
         border=NA,
         main=titleText2,
         xlim=c(0,50), 
         xlab = '(in %)',
         horiz = T,
         cex.names = 0.7,
         cex.lab=0.6,
         xaxt="n") # no x-axis, so I customize it below...

text(x=ToPlotOrd,y=location,labels=plotLabels,pos = 2,cex = 0.7) 
title(sub=sourceText, adj=0,cex.sub=0.7,line = 3) 

# reference line
abline(v=25,lty=3,lwd=3)

# customizong tick values
xtick<-c(0,10,25,40,50)
axis(side=1, at=xtick, labels = xtick,cex.axis=0.8)

The plot above simply change the order of the table. If you want to do the same with ggplot you should try the command:

tableFreq[order(-tableFreq$pct),]
##    locale       pct
## 3  Suburb 32.880099
## 1    City 29.419036
## 2   Rural 20.807581
## 4    Town 13.926658
## 5 Unknown  2.966625

Exercise:
Use ggplot to show a bar plot ordered by share size.

We could use our reference line to show gaps or differences. In this case, the Lollipop plot may be useful. This one is just a replacement for a bar plot:

base = ggplot(tableFreq, aes(x=locale,pct)) 
lolliplot1=base + geom_segment(aes(y = 0, 
                                   x = locale, 
                                   yend = pct, 
                                   xend = locale), color = "grey50") 
lolliplot1 + geom_point()

And, if you order the data frame:

tableFreq[order(tableFreq$pct),]
##    locale       pct
## 5 Unknown  2.966625
## 4    Town 13.926658
## 2   Rural 20.807581
## 1    City 29.419036
## 3  Suburb 32.880099

You can get:

# reordering DF steps:
tableFreqO=tableFreq[order(tableFreq$pct),]


base = ggplot(tableFreqO, aes(locale,pct)) 
lolliplot1=base + geom_segment(aes(y = 0, 
                                   x = locale, 
                                   yend = pct, 
                                   xend = locale), color = "gray") 
lolliplot2 = lolliplot1 + geom_point()
lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) # key element

And, what about changing the axis values so that we can identify the gaps:

# new variable
tableFreqO$gap=tableFreqO$pct-25

# plot the new variable
base = ggplot(tableFreqO, aes(locale,gap)) 

lolliplot1=base + geom_segment(aes(y = 0, 
                                   x = locale, 
                                   yend = gap, 
                                   xend = locale), color = "gray") 
lolliplot2 = lolliplot1 + geom_point()
lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) # key element

Maybe add some color:

# a new column for color
tableFreqO$PositiveGap=ifelse(tableFreqO$gap>0,T,F)

# add new aesthetics 'color'
base = ggplot(tableFreqO, aes(locale,gap,
                              color=PositiveGap)) #change
lolliplot1=base + geom_segment(aes(y = 0, 
                                   x = locale, 
                                   yend = gap, 
                                   xend = locale), color = "gray") 
lolliplot2 = lolliplot1 + geom_point()
lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) # key element

Maybe add some extra info:

# a new column for color
tableFreqO$PositiveGap=ifelse(tableFreqO$gap>0,T,F)

base = ggplot(tableFreqO, aes(locale,gap,color=PositiveGap,
                              label = round(gap,3))) #  change
lolliplot1=base + geom_segment(aes(y = 0, 
                                   x = locale, 
                                   yend = gap, 
                                   xend = locale), color = "gray") 
lolliplot2=lolliplot1 + geom_point() 
lolliplot3= lolliplot2 + scale_x_discrete(limits=tableFreqO$locale) 
# annotating and moving the text on the horizontal
lolliplot3 + geom_text(nudge_x=0.3) 

You can avoid the overlaping symbols in the legend by using:

lolliplot3 + geom_text(nudge_x=0.3,show.legend = FALSE) 

Exercise:
Complete adding the elements missing in the last plot.

Go to table of contents.

2. Visualization for ordinal scales

For this section, we will use the variable that tells us the highest grade offered in a school. A simple exploration gives:

table(eduwa$High.Grade,exclude = 'nothing')
## 
##  PK  KG   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  82   7   6  16  19  45 755 266  11 427  15   7   5 757   9

Being a categorical variable, the default option is again the bar plot. So let’s prepare the frequency table as a data frame:

frqTabO=as.data.frame(prop.table(table(eduwa$High.Grade)))
names(frqTabO)=c('grade','pct')
frqTabO
##    grade         pct
## 1     PK 0.033786568
## 2     KG 0.002884219
## 3      1 0.002472188
## 4      2 0.006592501
## 5      3 0.007828595
## 6      4 0.018541409
## 7      5 0.311083642
## 8      6 0.109600330
## 9      7 0.004532344
## 10     8 0.175937371
## 11     9 0.006180470
## 12    10 0.002884219
## 13    11 0.002060157
## 14    12 0.311907705
## 15    13 0.003708282

Now, we can use ggplot:

base = ggplot(frqTabO,aes(x=grade,y=pct))
base + geom_bar(stat = 'identity') 

The x-values in this variable have order. That is, there is an increasing level in the values. Whenever we have an ordering, besides concentration we can visualize symmetry: if there is bias towards lower or higher values.

Bar plots help you see concentration and symmetry, but we have an alternative way to clearly detect symmetry, via boxplots:

# boxplots do not use frequency tables

# as.numeric produces turns levels of the factor into numbers
box1 = ggplot(eduwa, aes(y=as.numeric(High.Grade))) 
box1 = box1 + geom_boxplot() + coord_flip() # to show it horizontally

box1

You have symmetry when the distance of those whiskers to the box is the same, and when the thick line is in the middle of the box. You can see that the values show a negative asymmetry (tail to the left).

Box plots expect a numeric value as an input, but we have an ordered categorical, so we used the as.numeric() function. However, that eliminated the levels we saw in the previous bar plot; we can put the levels back in our plot:

# the labels use the original ordinal levels
ordLabels= levels(eduwa$High.Grade)

box2 = box1 + scale_y_continuous(labels=ordLabels,breaks=1:15)
box2

Box plots have important statistical information. The beginning and the ending of the box indicates the first (q1) and the third quantile (q75); and the thicker line in the middle represents the median. All those values are clearly visible, but we can retrieve the values like this:

#get positions
# using 'ggplot_build'
pos_q1=     ggplot_build(box2)$data[[1]]$lower
pos_median= ggplot_build(box2)$data[[1]]$middle
pos_q3=     ggplot_build(box2)$data[[1]]$upper

# using
levels(eduwa$High.Grade)[c(pos_q1,pos_median,pos_q3)]
## [1] "5"  "8"  "12"

From the information retrieved, we know:

  • 25% of the public Schools offer at most 5th GRADE.
  • 50% of the public Schools offer at most 8th GRADE.
  • 75% of the public Schools offer at most 12th GRADE. Also, 25% of the schools offer at least 12th grade.

We can find these results with a detailed frequency table; that is, instead of using the command table as we did before, we could try a more advanced function:

library(summarytools)
freq(eduwa$High.Grade,style = 'rmarkdown')

Frequencies

Variable: eduwa$High.Grade
Type: Factor (ordered)

  Freq % Valid % Valid Cum. % Total % Total Cum.
PK 82 3.38 3.38 3.38 3.38
KG 7 0.29 3.67 0.29 3.67
1 6 0.25 3.91 0.25 3.91
2 16 0.66 4.57 0.66 4.57
3 19 0.78 5.36 0.78 5.36
4 45 1.85 7.21 1.85 7.21
5 755 31.11 38.32 31.11 38.32
6 266 10.96 49.28 10.96 49.28
7 11 0.45 49.73 0.45 49.73
8 427 17.59 67.33 17.59 67.33
9 15 0.62 67.94 0.62 67.94
10 7 0.29 68.23 0.29 68.23
11 5 0.21 68.44 0.21 68.44
12 757 31.19 99.63 31.19 99.63
13 9 0.37 100.00 0.37 100.00
<NA> 0 0.00 100.00
Total 2427 100.00 100.00 100.00 100.00

Exercise:
Make sure our box plot follows the same design approach and include all the elements as in the bar plot for nominal data.

Go to table of contents.

Counting

Counting expresses numerical values. They could be represented with bar plots if their frequency table had few different values. For example, the variable Reduced.Lunch informs how many kids there are in each school that have that lunch for a reduced price.

# how many unique values
length(unique(eduwa$Reduced.Lunch))
## [1] 172

There are too many different values. Then, the bar plot is not a good idea (and neither a frequency table):

barplot(table(eduwa$Reduced.Lunch),las=2,cex.names = 0.3,
        main='bad idea')

On the other hand, when we have a numerical variable, there are more statistical values that help understand its behavior:

# median close to mean?
# median and mean far from max or min?
# q1 distance to min is similar ti q3 distance to max?
# how many missing?

summary(eduwa$Reduced.Lunch)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.00   25.50   33.53   47.00  301.00     131

The bar plot produces a bar for each unique value in the data, counting how many times this value appeared. Now, we have many values, so we need to organize the data into intervals. The histogram is the basic plot when intervals are needed, you can use the basic function:

eduwa3=eduwa[complete.cases(eduwa$Reduced.Lunch),]
dataHist=hist(eduwa3$Reduced.Lunch) #saving info in dataHist

The width of each bin (bar) represents an interval of values, while its height the frequency. The histogram shows an asymmetric shape, where the bin with lowest values of the variable (between 0 and 20) are the most common (above 1000).

Of course, ggplot has a version of histograms:

base= ggplot(eduwa3,aes(x = Reduced.Lunch))  
h1= base + geom_histogram()
h1 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that you do not get the same plot. Let’s see the info from the basic function:

dataHist
## $breaks
##  [1]   0  20  40  60  80 100 120 140 160 180 200 220 240 260 280 300 320
## 
## $counts
##  [1] 1011  549  371  166   72   49   30   18   12    8    2    3    1    2
## [15]    1    1
## 
## $density
##  [1] 2.201655e-02 1.195557e-02 8.079268e-03 3.614983e-03 1.567944e-03
##  [6] 1.067073e-03 6.533101e-04 3.919861e-04 2.613240e-04 1.742160e-04
## [11] 4.355401e-05 6.533101e-05 2.177700e-05 4.355401e-05 2.177700e-05
## [16] 2.177700e-05
## 
## $mids
##  [1]  10  30  50  70  90 110 130 150 170 190 210 230 250 270 290 310
## 
## $xname
## [1] "eduwa3$Reduced.Lunch"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

And now see the info that was used in ggplot:

ggplot_build(h1)$data[[1]]
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##      y count         x       xmin       xmax      density      ncount
## 1  590   590   0.00000  -5.189655   5.189655 2.475778e-02 1.000000000
## 2  287   287  10.37931   5.189655  15.568966 1.204319e-02 0.486440678
## 3  271   271  20.75862  15.568966  25.948276 1.137179e-02 0.459322034
## 4  317   317  31.13793  25.948276  36.327586 1.330206e-02 0.537288136
## 5  247   247  41.51724  36.327586  46.706897 1.036470e-02 0.418644068
## 6  183   183  51.89655  46.706897  57.086207 7.679107e-03 0.310169492
## 7  103   103  62.27586  57.086207  67.465517 4.322120e-03 0.174576271
## 8   83    83  72.65517  67.465517  77.844828 3.482874e-03 0.140677966
## 9   54    54  83.03448  77.844828  88.224138 2.265966e-03 0.091525424
## 10  32    32  93.41379  88.224138  98.603448 1.342795e-03 0.054237288
## 11  25    25 103.79310  98.603448 108.982759 1.049058e-03 0.042372881
## 12  23    23 114.17241 108.982759 119.362069 9.651336e-04 0.038983051
## 13  20    20 124.55172 119.362069 129.741379 8.392466e-04 0.033898305
## 14  13    13 134.93103 129.741379 140.120690 5.455103e-04 0.022033898
## 15  11    11 145.31034 140.120690 150.500000 4.615857e-04 0.018644068
## 16   7     7 155.68966 150.500000 160.879310 2.937363e-04 0.011864407
## 17   5     5 166.06897 160.879310 171.258621 2.098117e-04 0.008474576
## 18   7     7 176.44828 171.258621 181.637931 2.937363e-04 0.011864407
## 19   3     3 186.82759 181.637931 192.017241 1.258870e-04 0.005084746
## 20   6     6 197.20690 192.017241 202.396552 2.517740e-04 0.010169492
## 21   1     1 207.58621 202.396552 212.775862 4.196233e-05 0.001694915
## 22   2     2 217.96552 212.775862 223.155172 8.392466e-05 0.003389831
## 23   1     1 228.34483 223.155172 233.534483 4.196233e-05 0.001694915
## 24   0     0 238.72414 233.534483 243.913793 0.000000e+00 0.000000000
## 25   1     1 249.10345 243.913793 254.293103 4.196233e-05 0.001694915
## 26   0     0 259.48276 254.293103 264.672414 0.000000e+00 0.000000000
## 27   1     1 269.86207 264.672414 275.051724 4.196233e-05 0.001694915
## 28   2     2 280.24138 275.051724 285.431034 8.392466e-05 0.003389831
## 29   0     0 290.62069 285.431034 295.810345 0.000000e+00 0.000000000
## 30   1     1 301.00000 295.810345 306.189655 4.196233e-05 0.001694915
##       ndensity PANEL group ymin ymax colour   fill size linetype alpha
## 1  1.000000000     1    -1    0  590     NA grey35  0.5        1    NA
## 2  0.486440678     1    -1    0  287     NA grey35  0.5        1    NA
## 3  0.459322034     1    -1    0  271     NA grey35  0.5        1    NA
## 4  0.537288136     1    -1    0  317     NA grey35  0.5        1    NA
## 5  0.418644068     1    -1    0  247     NA grey35  0.5        1    NA
## 6  0.310169492     1    -1    0  183     NA grey35  0.5        1    NA
## 7  0.174576271     1    -1    0  103     NA grey35  0.5        1    NA
## 8  0.140677966     1    -1    0   83     NA grey35  0.5        1    NA
## 9  0.091525424     1    -1    0   54     NA grey35  0.5        1    NA
## 10 0.054237288     1    -1    0   32     NA grey35  0.5        1    NA
## 11 0.042372881     1    -1    0   25     NA grey35  0.5        1    NA
## 12 0.038983051     1    -1    0   23     NA grey35  0.5        1    NA
## 13 0.033898305     1    -1    0   20     NA grey35  0.5        1    NA
## 14 0.022033898     1    -1    0   13     NA grey35  0.5        1    NA
## 15 0.018644068     1    -1    0   11     NA grey35  0.5        1    NA
## 16 0.011864407     1    -1    0    7     NA grey35  0.5        1    NA
## 17 0.008474576     1    -1    0    5     NA grey35  0.5        1    NA
## 18 0.011864407     1    -1    0    7     NA grey35  0.5        1    NA
## 19 0.005084746     1    -1    0    3     NA grey35  0.5        1    NA
## 20 0.010169492     1    -1    0    6     NA grey35  0.5        1    NA
## 21 0.001694915     1    -1    0    1     NA grey35  0.5        1    NA
## 22 0.003389831     1    -1    0    2     NA grey35  0.5        1    NA
## 23 0.001694915     1    -1    0    1     NA grey35  0.5        1    NA
## 24 0.000000000     1    -1    0    0     NA grey35  0.5        1    NA
## 25 0.001694915     1    -1    0    1     NA grey35  0.5        1    NA
## 26 0.000000000     1    -1    0    0     NA grey35  0.5        1    NA
## 27 0.001694915     1    -1    0    1     NA grey35  0.5        1    NA
## 28 0.003389831     1    -1    0    2     NA grey35  0.5        1    NA
## 29 0.000000000     1    -1    0    0     NA grey35  0.5        1    NA
## 30 0.001694915     1    -1    0    1     NA grey35  0.5        1    NA

The first ‘x’ was 0 in ggplot, while it was 10 (in $mids) in the base graphic; from there on everything changed. And not only that, you have 16 bins in the base graphic, while you got 30 in ggplot.

Of course, you can alter that in both alternatives.

Below, you can see a version where both plots are the same:

#ggplot
base= ggplot(eduwa3,aes(x = Reduced.Lunch))  
h1= base + geom_histogram(binwidth = 20,boundary=0) #changing width
h1= h1 + stat_bin(binwidth = 20, aes(label=..count..), 
                  geom = "text",boundary = 0,vjust=-0.5)
h1

# base
hist(eduwa3$Reduced.Lunch,labels = T,xlab="Reduced Lunch")

Of course, you can make it a litle better:

hist(eduwa3$Reduced.Lunch,labels = T,xlab="Reduced Lunch", xaxt="n") 
axis(side=1, at=dataHist$breaks) # showing axis labels better

As mentioned before, we are plotting intervals, so the accompanying table can be built. For that, we first create the intervals into another variable:

eduwa3$redLunchOrd=cut(eduwa3$Reduced.Lunch,
                       breaks = dataHist$breaks,
                       include.lowest = T,
                       ordered_result = T)

And, as before, we use the freq function:

# no need to show count of NAs:
freq(eduwa3$redLunchOrd,style = 'rmarkdown',report.nas = F)

Frequencies

Variable: eduwa3$redLunchOrd
Type: Factor (ordered)

  Freq % % Cum.
[0,20] 1011 44.03 44.03
(20,40] 549 23.91 67.94
(40,60] 371 16.16 84.10
(60,80] 166 7.23 91.33
(80,100] 72 3.14 94.47
(100,120] 49 2.13 96.60
(120,140] 30 1.31 97.91
(140,160] 18 0.78 98.69
(160,180] 12 0.52 99.22
(180,200] 8 0.35 99.56
(200,220] 2 0.09 99.65
(220,240] 3 0.13 99.78
(240,260] 1 0.04 99.83
(260,280] 2 0.09 99.91
(280,300] 1 0.04 99.96
(300,320] 1 0.04 100.00
Total 2296 100.00 100.00

Exercise:
Make a histogram for the variable FREE LUNCH, and make sure it has all the right elements, and get rid of unnecessary elements.

Go to table of contents.

Measurement

A simplistic idea of measurement tells you the times a particular unit is present in the unit of analysis; which allows for the presence of decimal places. There are variables that can have negative values.

Let’s analyze the variable Student.Teacher.Ratio, but organized by county:

# tapply(variable,group,functionToApply)
tapply(eduwa$Student.Teacher.Ratio, eduwa$County, mean)
##        Adams       Asotin       Benton       Chelan      Clallam 
##           NA           NA           NA           NA           NA 
##        Clark     Columbia      Cowlitz      Douglas        Ferry 
##           NA           NA           NA           NA           NA 
##     Franklin     Garfield        Grant Grays Harbor       Island 
##           NA     17.35000           NA           NA           NA 
##    Jefferson         King       Kitsap     Kittitas    Klickitat 
##           NA           NA           NA           NA           NA 
##        Lewis      Lincoln        Mason     Okanogan      Pacific 
##           NA     11.56000           NA           NA           NA 
## Pend Oreille       Pierce     San Juan       Skagit     Skamania 
##     15.47778           NA           NA           NA     16.37000 
##    Snohomish      Spokane      Stevens     Thurston    Wahkiakum 
##           NA           NA           NA           NA     18.15000 
##  Walla Walla      Whatcom      Whitman       Yakima 
##           NA           NA           NA           NA

Above, I tried to compute the mean for each county, but the function mean() outputs a missing value (NA) as the result when there is one NA in the column:

# strategy 1: remove missing before computing function: na.rm=T
tapply(eduwa$Student.Teacher.Ratio, eduwa$County, mean,na.rm=T)
##        Adams       Asotin       Benton       Chelan      Clallam 
##     14.80769     19.11111     20.38947     18.65000     19.27600 
##        Clark     Columbia      Cowlitz      Douglas        Ferry 
##     19.19744     11.30000     20.38974     16.47222     16.82727 
##     Franklin     Garfield        Grant Grays Harbor       Island 
##     18.10323     17.35000     17.52449     16.33846     19.53000 
##    Jefferson         King       Kitsap     Kittitas    Klickitat 
##     16.57857     18.80203     19.24030     18.93529     14.82632 
##        Lewis      Lincoln        Mason     Okanogan      Pacific 
##     18.93902     11.56000     16.76316     19.73226     22.39375 
## Pend Oreille       Pierce     San Juan       Skagit     Skamania 
##     15.47778     18.77931     15.90000     22.47209     16.37000 
##    Snohomish      Spokane      Stevens     Thurston    Wahkiakum 
##     20.49239     18.13542     19.61765     19.29231     18.15000 
##  Walla Walla      Whatcom      Whitman       Yakima 
##     15.55652     23.77213     13.27083     19.97500

Of course, you can clean first:

# strategy 2: 
eduwa4=eduwa[complete.cases(eduwa$Student.Teacher.Ratio),]

tapply(eduwa4$Student.Teacher.Ratio, 
       eduwa4$County, 
       mean)
##        Adams       Asotin       Benton       Chelan      Clallam 
##     14.80769     19.11111     20.38947     18.65000     19.27600 
##        Clark     Columbia      Cowlitz      Douglas        Ferry 
##     19.19744     11.30000     20.38974     16.47222     16.82727 
##     Franklin     Garfield        Grant Grays Harbor       Island 
##     18.10323     17.35000     17.52449     16.33846     19.53000 
##    Jefferson         King       Kitsap     Kittitas    Klickitat 
##     16.57857     18.80203     19.24030     18.93529     14.82632 
##        Lewis      Lincoln        Mason     Okanogan      Pacific 
##     18.93902     11.56000     16.76316     19.73226     22.39375 
## Pend Oreille       Pierce     San Juan       Skagit     Skamania 
##     15.47778     18.77931     15.90000     22.47209     16.37000 
##    Snohomish      Spokane      Stevens     Thurston    Wahkiakum 
##     20.49239     18.13542     19.61765     19.29231     18.15000 
##  Walla Walla      Whatcom      Whitman       Yakima 
##     15.55652     23.77213     13.27083     19.97500

Great!

Now let me plot a histogram of those means:

# keeping strategy 2: 
meanValues=tapply(eduwa4$Student.Teacher.Ratio, 
                  eduwa4$County, 
                  mean)
hist(meanValues)

Let’s compute some statistics:

summary(meanValues)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.30   16.42   18.65   17.96   19.41   23.77

You can use that info, for example, to plot the mean as a reference line:

#reference line
hist(meanValues)
abline(v=mean(meanValues),lty=3,lwd=3,col='blue')

Measurements are continuous values, then a density plot is more appealing to its nature:

mvDense=density(meanValues)

plot(mvDense,main="Title",col='black',xlab=NA)

abline(v=mean(meanValues),lty=3,lwd=3,col='blue') #mean
abline(v=median(meanValues),lty=3,lwd=3,col='red')#median
legend(x="right",
       legend=c('mean','median'),
       fill = c('blue','red'),bty = 'n') #no box in the legend

A box plot is always welcome, specially considering that it does not need reference lines. Take a look:

bp=boxplot(meanValues,horizontal = T,ylim=c(5,30))

Our plots for the mean values have a more symmetrical shape. This happens when you get mean values of groups, showing a tendency towards a bell-shaped distribution, which is ideally known as the Gauss or Normal distribution.

Notice also that boxplots serve to detect atypical values (outliers), which I saved in bp:

bp$out
## Columbia  Lincoln 
##    11.30    11.56

We could annotate the boxplot like this:

boxplot(meanValues,horizontal = T,ylim=c(5,30))
text(x= 10, y= 0.8, labels= "Outliers are:",col='gray')
text(x= 10, y= 0.75, 
     labels= paste(names(bp$out)[1], 'and', names(bp$out)[2]),
     col='gray')

In general, measurements and counts are prone to have outliers. It is not common to speak about outliers in categorical data since they have few levels; however, if they had many levels, we could find outliers if the variable is ordinal.

From what I said above, the subjective side of finding outliers lies in the decision of what is normal. In the case of the boxplot, the decision has been to accept as normal the values that have a prudent distance from the first or last quartile. This distance is 1.5 times the difference between the quartiles (a.k.a. Interquartle Range or IQR). Then, if a outlier is found, the whisker ends in a position different than the actual minimum or maximal value of the data.

Exercise:
Do some research and make a histogram and a density plot using ggplot for the variable we just used above.


Go to table of contents.

Back to course schedule menu